16 research outputs found

    Finding Street Gang Members on Twitter

    Full text link
    Most street gang members use Twitter to intimidate others, to present outrageous images and statements to the world, and to share recent illegal activities. Their tweets may thus be useful to law enforcement agencies to discover clues about recent crimes or to anticipate ones that may occur. Finding these posts, however, requires a method to discover gang member Twitter profiles. This is a challenging task since gang members represent a very small population of the 320 million Twitter users. This paper studies the problem of automatically finding gang members on Twitter. It outlines a process to curate one of the largest sets of verifiable gang member profiles that have ever been studied. A review of these profiles establishes differences in the language, images, YouTube links, and emojis gang members use compared to the rest of the Twitter population. Features from this review are used to train a series of supervised classifiers. Our classifier achieves a promising F1 score with a low false positive rate.Comment: 8 pages, 9 figures, 2 tables, Published as a full paper at 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2016

    Knowledge will Propel Machine Understanding of Content: Extrapolating from Current Examples

    Full text link
    Machine Learning has been a big success story during the AI resurgence. One particular stand out success relates to learning from a massive amount of data. In spite of early assertions of the unreasonable effectiveness of data, there is increasing recognition for utilizing knowledge whenever it is available or can be created purposefully. In this paper, we discuss the indispensable role of knowledge for deeper understanding of content where (i) large amounts of training data are unavailable, (ii) the objects to be recognized are complex, (e.g., implicit entities and highly subjective content), and (iii) applications need to use complementary or related data in multiple modalities/media. What brings us to the cusp of rapid progress is our ability to (a) create relevant and reliable knowledge and (b) carefully exploit knowledge to enhance ML/NLP techniques. Using diverse examples, we seek to foretell unprecedented progress in our ability for deeper understanding and exploitation of multimodal data and continued incorporation of knowledge in learning techniques.Comment: Pre-print of the paper accepted at 2017 IEEE/WIC/ACM International Conference on Web Intelligence (WI). arXiv admin note: substantial text overlap with arXiv:1610.0770

    A Semantics-Based Measure of Emoji Similarity

    Get PDF
    Emoji have grown to become one of the most important forms of communication on the web. With its widespread use, measuring the similarity of emoji has become an important problem for contemporary text processing since it lies at the heart of sentiment analysis, search, and interface design tasks. This paper presents a comprehensive analysis of the semantic similarity of emoji through embedding models that are learned over machine-readable emoji meanings in the EmojiNet knowledge base. Using emoji descriptions, emoji sense labels and emoji sense definitions, and with different training corpora obtained from Twitter and Google News, we develop and test multiple embedding models to measure emoji similarity. To evaluate our work, we create a new dataset called EmoSim508, which assigns human-annotated semantic similarity scores to a set of 508 carefully selected emoji pairs. After validation with EmoSim508, we present a real-world use-case of our emoji embedding models using a sentiment analysis task and show that our models outperform the previous best-performing emoji embedding model on this task. The EmoSim508 dataset and our emoji embedding models are publicly released with this paper and can be downloaded from http://emojinet.knoesis.org/.Comment: This paper is accepted at Web Intelligence 2017 as a full paper, In 2017 IEEE/WIC/ACM International Conference on Web Intelligence (WI). Leipzig, Germany: ACM, 201

    Finding Street Gang Member Profiles on Twitter

    Get PDF
    The crime and violence street gangs introduce into neighborhoods is a growing epidemic in cities around the world. Today, over 1.4 million people, belonging to more than 33,000 gangs, are active in the United States, of which 88% identify themselves as being members of a street gang. With the recent popularity of social media, street gang members have established online presences coinciding with their physical occupation of neighborhoods. Recent studies report that approximately 45% of gang members participate in online offending activities such as threatening, harassing individuals, posting violent videos or attacking someone on the street for something they said online in social media platforms. Thus, their social media posts may be useful to social workers and law enforcement agencies to discover clues about recent crimes or to anticipate ones that may occur in a community. Finding these posts, however, requires a method to discover gang member social media profiles. This is a challenging task since gang members represent a very small population compared to the active social media user base. This thesis studies the problem of automatically identifying street gang member profiles on Twitter, which is a popular social media platform that is commonly used by street gang members to promote their online gang-related activities. It outlines a process to curate one of the largest sets of verifiable gang member Twitter profiles that have ever been studied. A review of these profiles establishes differences in the language, profile and cover images, YouTube links, and emoji shared on Twitter by gang members compared to the rest of the Twitter population. Beyond the earlier efforts in Twitter profile identification that utilize features derived from the profile and tweet text, this thesis uses additional heterogeneous sets of features from the emoji usage, profile images, and links to YouTube videos reflecting gang-related music culture towards solving the gang member profile identification problem. Features from this review are used to train a series of supervised machine learning classifiers and they are further improved upon by using word embeddings learned over a large corpus of tweets. Experimental results demonstrate that heterogeneous features enabled our classifiers to achieve low false positive rates and promising F 1-scores

    Finding Street Gang Member Profiles on Twitter

    Get PDF
    The crime and violence street gangs introduce into neighborhoods is a growing epidemic in cities around the world. Today, over 1.4 million people, belonging to more than 33,000 gangs, are active in the United States, of which 88% identify themselves as being members of a street gang. With the recent popularity of social media, street gang members have established online presences coinciding with their physical occupation of neighborhoods. Recent studies report that approximately 45% of gang members participate in online offending activities such as threatening, harassing individuals, posting violent videos or attacking someone on the street for something they said online in social media platforms. Thus, their social media posts may be useful to social workers and law enforcement agencies to discover clues about recent crimes or to anticipate ones that may occur in a community. Finding these posts, however, requires a method to discover gang member social media profiles. This is a challenging task since gang members represent a very small population compared to the active social media user base. This thesis studies the problem of automatically identifying street gang member profiles on Twitter, which is a popular social media platform that is commonly used by street gang members to promote their online gang-related activities. It outlines a process to curate one of the largest sets of verifiable gang member Twitter profiles that have ever been studied. A review of these profiles establishes differences in the language, profile and cover images, YouTube links, and emoji shared on Twitter by gang members compared to the rest of the Twitter population. Beyond the earlier efforts in Twitter profile identification that utilize features derived from the profile and tweet text, this thesis uses additional heterogeneous sets of features from the emoji usage, profile images, and links to YouTube videos reflecting gang-related music culture towards solving the gang member profile identification problem. Features from this review are used to train a series of supervised machine learning classifiers and they are further improved upon by using word embeddings learned over a large corpus of tweets. Experimental results demonstrate that heterogeneous features enabled our classifiers to achieve low false positive rates and promising F 1-scores

    EmojiNet: An Open Service and API for Emoji Sense Discovery

    No full text
    Emoji have grown to become one of the most important forms of communication on the web. With its widespread use, measuring the similarity of emoji has become an important problem for contemporary text processing since it lies at the heart of sentiment analysis, search, and interface design tasks. This paper presents a comprehensive analysis of the semantic similarity of emoji through embedding models that are learned over machine-readable emoji meanings in the EmojiNet knowledge base. Using emoji descriptions, emoji sense labels and emoji sense definitions, and with different training corpora obtained from Twitter and Google News, we develop and test multiple embedding models to measure emoji similarity. To evaluate our work, we create a new dataset called EmoSim508, which assigns human-annotated semantic similarity scores to a set of 508 carefully selected emoji pairs. After validation with EmoSim508, we present a real-world use-case of our emoji embedding models using a sentiment analysis task and show that our models outperform the previous best-performing emoji embedding model on this task. The EmoSim508 dataset and our emoji embedding models are publicly released with this paper and can be downloaded from http://emojinet.knoesis.org/

    A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research

    No full text
    A quality annotated corpus is essential to research. Despite the re- cent focus of the Web science community on cyberbullying research, the community lacks standard benchmarks. This paper provides both a quality annotated corpus and an o ensive words lexicon capturing di erent types of harassment content: (i) sexual, (ii) racial, (iii) appearance-related, (iv) intellectual, and (v) political1. We rst crawled data from Twitter using this content-tailored o ensive lexicon. As mere presence of an o ensive word is not a reliable indicator of harassment, human judges annotated tweets for the presence of harassment. Our corpus consists of 25,000 annotated tweets for the ve types of harassment content and is available on the Git repository2

    A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research

    No full text
    A quality annotated corpus is essential to research. Despite the re- cent focus of the Web science community on cyberbullying research, the community lacks standard benchmarks. This paper provides both a quality annotated corpus and an o ensive words lexicon capturing di erent types of harassment content: (i) sexual, (ii) racial, (iii) appearance-related, (iv) intellectual, and (v) political1. We rst crawled data from Twitter using this content-tailored o ensive lexicon. As mere presence of an o ensive word is not a reliable indicator of harassment, human judges annotated tweets for the presence of harassment. Our corpus consists of 25,000 annotated tweets for the ve types of harassment content and is available on the Git repository2

    A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research

    No full text
    A quality annotated corpus is essential to research. Despite the re- cent focus of the Web science community on cyberbullying research, the community lacks standard benchmarks. This paper provides both a quality annotated corpus and an o ensive words lexicon capturing di erent types of harassment content: (i) sexual, (ii) racial, (iii) appearance-related, (iv) intellectual, and (v) political1. We rst crawled data from Twitter using this content-tailored o ensive lexicon. As mere presence of an o ensive word is not a reliable indicator of harassment, human judges annotated tweets for the presence of harassment. Our corpus consists of 25,000 annotated tweets for the ve types of harassment content and is available on the Git repository2
    corecore